Skip to content

refactor(integration-tests): Deprecate IntegrationTestLogs and tests.utils.config.IntegrationTestPathConfig.#2241

Open
quinntaylormitchell wants to merge 6 commits intoy-scope:mainfrom
quinntaylormitchell:testing-migration-datasets
Open

refactor(integration-tests): Deprecate IntegrationTestLogs and tests.utils.config.IntegrationTestPathConfig.#2241
quinntaylormitchell wants to merge 6 commits intoy-scope:mainfrom
quinntaylormitchell:testing-migration-datasets

Conversation

@quinntaylormitchell
Copy link
Copy Markdown
Collaborator

@quinntaylormitchell quinntaylormitchell commented May 4, 2026

Description

This PR migrates the testing suite with respect to the new designs implemented in #2181 (IntegrationTestDataset) and #2106 (tests.utils.classes.IntegrationTestPathConfig). The IntegrationTestLogs and tests.utils.config.IntegrationTestPathConfig objects are now deprecated, and all related flows have been modified accordingly.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Ran uv run pytest -m 'package' and uv run pytest -m 'core'; all tests pass.

Summary by CodeRabbit

  • Tests

    • Updated integration test infrastructure to use structured dataset fixtures for improved test organization and maintainability.
  • Chores

    • Reorganized test data configuration and fixture structure for better dataset management.

@quinntaylormitchell quinntaylormitchell requested a review from a team as a code owner May 4, 2026 18:28
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Refactored integration test fixture system to introduce a new SampleDataset abstraction with corresponding fixtures (json_multifile, text_multifile, text_singlefile). Reorganized path configuration to use test_cache_dir instead of test_root_dir, and updated all integration tests to use the new typed dataset fixtures instead of parameterized string-based logs.

Changes

Dataset Abstraction and Fixtures

Layer / File(s) Summary
Data Shape
integration-tests/tests/utils/classes.py
Introduced SampleDatasetMetadata and SampleDataset classes to replace older IntegrationTestDataset; added downloaded_logs_dir property to IntegrationTestPathConfig.
Test Data
integration-tests/tests/data/text_singlefile/...
Added new text_singlefile dataset directory with simple.txt log file (11 entries) and metadata.json configuration.
Fixture Definitions
integration-tests/tests/fixtures/sample_datasets.py
Created new session-scoped fixtures json_multifile, text_multifile, and text_singlefile returning SampleDataset instances configured with corresponding dataset root directories.
Fixture Registration
integration-tests/tests/conftest.py
Updated pytest_plugins to register tests.fixtures.sample_datasets and remove references to old fixture modules.
Path Configuration
integration-tests/tests/fixtures/path_configs.py, integration-tests/tests/utils/config.py
Refactored IntegrationTestPathConfig import location and constructor signature; updated PackagePathConfig, CompressionTestPathConfig, and ConversionTestPathConfig to use test_cache_dir instead of test_root_dir.
Downloaded Datasets
integration-tests/tests/fixtures/downloaded_datasets.py
Introduced DownloadedDataset dataclass and updated hive_24hr fixture and _download_and_extract_gzip_dataset helper to return the new type with computed tarball and extraction paths.
Test Updates
integration-tests/tests/package_tests/clp_json/test_clp_json.py, integration-tests/tests/package_tests/clp_text/test_clp_text.py
Updated test function signatures to accept typed json_multifile and text_multifile fixtures; replaced hardcoded paths and values with dataset-driven configuration and metadata.
Identity Transform Tests
integration-tests/tests/test_identity_transformation.py
Removed parameterized text_datasets and json_datasets module variables; refactored test_clp_identity_transform and test_clp_s_identity_transform to accept text_multifile and json_multifile fixtures respectively; simplified test flow to derive dataset names from fixtures.
Log Converter Test
integration-tests/tests/test_log_converter.py
Removed parameterized test_logs_fixture parameter; updated test_log_converter_transform to accept text_singlefile fixture and derive log event count and paths from dataset metadata.
Test Data Cleanup
integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/*, integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/*, **/README.md
Removed existing JSONL and text log entries from multi-file datasets and deleted associated README sections documenting log metadata.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: deprecating IntegrationTestLogs and IntegrationTestPathConfig in favour of new implementations, which is reflected throughout the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
integration-tests/tests/test_log_converter.py (1)

86-90: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Address substring ambiguity and non-deterministic ordering in event validation.

Two substantive issues:

  1. Substring false-positive for TEST1. The test data contains messages ending with TEST1 through TEST11. The check f"TEST{idx + 1}" not in message at idx=0 tests for "TEST1", which is a substring of both "TEST10" and "TEST11". If clp-s returns TEST10 or TEST11 at the first position, the validation silently passes despite checking the wrong event.

  2. Non-deterministic ordering for equal timestamps. All 11 log events share the same timestamp (1427089710122 ms). The clp-s search code iterates through messages sequentially as stored in the archive, but there is no explicit guarantee that events with identical timestamps will be returned in insertion order. Any change to the compression or search pipeline could alter this ordering without notice.

Minimum fix — use .endswith() to eliminate substring matches:

Proposed fix
     for idx, line in enumerate(lines):
         event = json.loads(line)
         message = event.get("message", "")
-        if f"TEST{idx + 1}" not in message:
+        if not message.endswith(f"TEST{idx + 1}"):
             pytest.fail(
                 f"Expected 'TEST{idx + 1}' in message of event {idx + 1}, but got: {event}"
             )

This addresses the substring issue but retains the ordering assumption. If clp-s does not guarantee insertion order for equal timestamps, replace with an order-independent set comparison (as suggested in the original diff proposal).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@integration-tests/tests/test_log_converter.py` around lines 86 - 90, The
current loop over lines uses substring containment (if f"TEST{idx + 1}" not in
message) which yields false-positives for TEST1 matching TEST10/11 and also
assumes deterministic ordering for equal timestamps; update the validation in
the loop that processes event/message (variables event, message, idx) to use
message.endswith(f"TEST{idx + 1}") to avoid substring matches, and for
robustness against non-deterministic ordering of identical-timestamp events
replace the positional checks with an order-independent comparison: collect all
extracted TEST labels from messages and compare that set/list against the
expected set of {"TEST1",..."TEST11"} (or assert sorted lists) instead of
relying on sequential idx matching.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@integration-tests/tests/data/text_singlefile/metadata.json`:
- Around line 5-6: Update the timestamp values in metadata.json so they
accurately reflect the log events: change the "begin_ts" and "end_ts" JSON
fields from 1427089710000 to 1427089710122 (the actual event epoch milliseconds
for 2015-03-23 05:48:30,122) so the metadata window matches the timestamps in
simple.txt.

In `@integration-tests/tests/fixtures/path_configs.py`:
- Line 38: PackagePathConfig is being constructed with
integration_tests_project_root as test_root_dir which causes
PackagePathConfig.__post_init__ to mkdir temp_config_dir under the project root;
change the constructor call to pass integration_test_path_config.test_cache_dir
(the same cache dir used by CompressionTestPathConfig and
ConversionTestPathConfig) so temp_config_dir is created under the dedicated
cache directory and aligns behavior across PackagePathConfig,
CompressionTestPathConfig and ConversionTestPathConfig.

In `@integration-tests/tests/test_log_converter.py`:
- Around line 37-40: The current calculation of num_log_events opens files
directly in a generator expression, risking leaked file handles; change the
logic to iterate over text_singlefile.metadata.file_names with an explicit with
context manager when opening (text_singlefile.logs_path /
file_name).open(encoding="utf-8") so each file is closed deterministically,
e.g., open each file inside a with block and sum lines (sum(1 for _ in fh)) into
num_log_events; update the code around the num_log_events computation to use
this pattern.

---

Outside diff comments:
In `@integration-tests/tests/test_log_converter.py`:
- Around line 86-90: The current loop over lines uses substring containment (if
f"TEST{idx + 1}" not in message) which yields false-positives for TEST1 matching
TEST10/11 and also assumes deterministic ordering for equal timestamps; update
the validation in the loop that processes event/message (variables event,
message, idx) to use message.endswith(f"TEST{idx + 1}") to avoid substring
matches, and for robustness against non-deterministic ordering of
identical-timestamp events replace the positional checks with an
order-independent comparison: collect all extracted TEST labels from messages
and compare that set/list against the expected set of {"TEST1",..."TEST11"} (or
assert sorted lists) instead of relying on sequential idx matching.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6831764d-9ae2-474c-9fac-0b8555d2ba7c

📥 Commits

Reviewing files that changed from the base of the PR and between 6b67f6c and 78fd082.

📒 Files selected for processing (23)
  • integration-tests/tests/conftest.py
  • integration-tests/tests/data/text_singlefile/logs/simple.txt
  • integration-tests/tests/data/text_singlefile/metadata.json
  • integration-tests/tests/fixtures/datasets.py
  • integration-tests/tests/fixtures/integration_test_logs.py
  • integration-tests/tests/fixtures/path_configs.py
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/README.md
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-08.jsonl
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-09.jsonl
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-11.jsonl
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-19.jsonl
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-21.jsonl
  • integration-tests/tests/package_tests/clp_json/test_clp_json.py
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/README.md
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day01.txt
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day04.txt
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day07.txt
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day10.txt
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day13.txt
  • integration-tests/tests/package_tests/clp_text/test_clp_text.py
  • integration-tests/tests/test_identity_transformation.py
  • integration-tests/tests/test_log_converter.py
  • integration-tests/tests/utils/config.py
💤 Files with no reviewable changes (14)
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-08.jsonl
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-19.jsonl
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/README.md
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day01.txt
  • integration-tests/tests/conftest.py
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/README.md
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-21.jsonl
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-11.jsonl
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day13.txt
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day04.txt
  • integration-tests/tests/package_tests/clp_json/data/json-multifile/logs/sts-135-2011-07-09.jsonl
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day10.txt
  • integration-tests/tests/package_tests/clp_text/data/text-multifile/logs/apollo-17_day07.txt
  • integration-tests/tests/fixtures/integration_test_logs.py

Comment thread integration-tests/tests/data/text_singlefile/metadata.json Outdated
Comment thread integration-tests/tests/fixtures/path_configs.py Outdated
Comment thread integration-tests/tests/test_log_converter.py Outdated
Comment thread integration-tests/tests/fixtures/integration_test_logs.py
Comment thread integration-tests/tests/fixtures/path_configs.py Outdated
@@ -5,11 +5,10 @@

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to this file need further discussion. We want to support both small and large datasets in integration tests.
For these tests specifically, we intentionally use large datasets to ensure compression and decompression remain lossless at scale.
Switching them to the new local datasets (or simply adding local datasets) would also be out of scope for this PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussion in progress

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored the large datasets as discussed.

@quinntaylormitchell quinntaylormitchell requested review from Bill-hbrhbr and removed request for Bill-hbrhbr May 6, 2026 02:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
integration-tests/tests/conftest.py (1)

12-17: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Register tests.fixtures.downloaded_datasets in pytest_plugins or remove the unused fixtures.

The hive_24hr and postgresql fixtures defined in tests/fixtures/downloaded_datasets.py are not registered in pytest_plugins and are not referenced anywhere in the codebase. Either add the module to the plugin list if tests will use these fixtures, or remove the unused fixture definitions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@integration-tests/tests/conftest.py` around lines 12 - 17, The pytest_plugins
list in conftest.py is missing the module that defines the hive_24hr and
postgresql fixtures; either add "tests.fixtures.downloaded_datasets" to the
pytest_plugins array so pytest registers those fixtures (pytest_plugins
variable) or delete the unused fixture definitions from
tests/fixtures/downloaded_datasets.py (fixtures named hive_24hr and postgresql)
if they aren’t needed; update accordingly and run tests to ensure no
import/fixture-not-found errors remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@integration-tests/tests/fixtures/downloaded_datasets.py`:
- Around line 36-47: In DownloadedDataset.__post_init__, add a one-line comment
explaining that object.__setattr__ is intentionally used to bypass frozen=True
so derived/normalized fields (name, tarball_path, extraction_dir) can be set
after construction; then either (A) change the dataclass API to make the raw
constructor value an InitVar (e.g., name_input) or mark name as init=False and
compute/assign the normalized name there, or (B) if keeping the current
signature, document in the class docstring that the constructor argument for
name is normalized (stripped) and may differ from the supplied value so callers
are not surprised. Ensure the comment references __post_init__,
object.__setattr__, and the fields name/tarball_path/extraction_dir.
- Line 95: Update the docstring for the function that returns a
DownloadedDataset so the article is correct: change the return description from
"An DownloadedDataset instance providing metadata for the downloaded logs." to
"A DownloadedDataset instance providing metadata for the downloaded logs." —
locate the docstring referencing DownloadedDataset (the function that returns a
DownloadedDataset) and fix the ":return:" line accordingly.

---

Outside diff comments:
In `@integration-tests/tests/conftest.py`:
- Around line 12-17: The pytest_plugins list in conftest.py is missing the
module that defines the hive_24hr and postgresql fixtures; either add
"tests.fixtures.downloaded_datasets" to the pytest_plugins array so pytest
registers those fixtures (pytest_plugins variable) or delete the unused fixture
definitions from tests/fixtures/downloaded_datasets.py (fixtures named hive_24hr
and postgresql) if they aren’t needed; update accordingly and run tests to
ensure no import/fixture-not-found errors remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c0ca1f1c-ca63-45a1-9411-e6b711ad1757

📥 Commits

Reviewing files that changed from the base of the PR and between fed048c and 3861d54.

📒 Files selected for processing (9)
  • integration-tests/tests/conftest.py
  • integration-tests/tests/data/README.md
  • integration-tests/tests/fixtures/downloaded_datasets.py
  • integration-tests/tests/fixtures/sample_datasets.py
  • integration-tests/tests/package_tests/clp_json/test_clp_json.py
  • integration-tests/tests/package_tests/clp_text/test_clp_text.py
  • integration-tests/tests/test_identity_transformation.py
  • integration-tests/tests/test_log_converter.py
  • integration-tests/tests/utils/classes.py

Comment thread integration-tests/tests/fixtures/downloaded_datasets.py
Comment thread integration-tests/tests/fixtures/downloaded_datasets.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants